Multi-Level Analysis and Annotation of Arabic Corpora for Text-to-Sign Language MT

نویسندگان

  • Abdelaziz Lakhfif
  • Mohamed Tayeb Laskri
  • Eric Atwell
چکیده

The Arabic language is morphologically rich and syntactically complex with many differences from European languages, and this creates a challenge when porting existing annotation tools to Arabic. In this paper, we present an ongoing effort in lexical semantic analysis and annotation of Modern Standard Arabic (MSA) text, a semi automatic annotation tool concerned with the morphologic, syntactic, and semantic levels of description. Besides the aim of providing a multi-level annotation tool for Arabic corpora, our goals are (1) to investigate the suitability of Frame Semantics (FS) approach (Fillmore 1985) for representing and analysing Arabic text (2) to provide corpus-attested linguistics materials for frame-based contrastive text analysis between Arabic and English in terms of lexicalization patterns; (3) to automatically derive mappings rules from annotated sentences. Such corpus-attested mapping rules between linguistic form and its meaning can support semantic analysis in knowledge-based NLP systems such as machine translation, information extraction etc. Following syntactically-based annotation projects for English, serious attempts have been made to annotate Arabic corpora, such as the Penn Arabic Treebank (PATB), (Maamouri et al. 2004) and the Quranic Arabic Dependency Treebank (QADT) (Dukes et al. 2010). However, semantically-based annotation for Arabic corpora has not yet been garnering the same attention. Our semantic representations are based upon use of frame-semantic paradigm; it is actually used in a MT system from Arabic to Algerian Sign Language aimed to assist deaf children and in order to bridge the gap between Arabic written texts and Algerian Sign Language (Lakhfif and Laskri 2010a,b, 2011). Annotation outputs are available in XML format compatible with the FrameNet project (Fillmore and Petruck 2003) design and can be portable to other NLP systems.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Arabic News Articles Classification Using Vectorized-Cosine Based on Seed Documents

Besides for its own merits, text classification (TC) has become a cornerstone in many applications. Work presented here is part of and a pre-requisite for a project we have overtaken to create a corpus for the Arabic text process. It is an attempt to create modules automatically that would help speed up the process of classification for any text categorization task. It also serves as a tool for...

متن کامل

A new model for persian multi-part words edition based on statistical machine translation

Multi-part words in English language are hyphenated and hyphen is used to separate different parts. Persian language consists of multi-part words as well. Based on Persian morphology, half-space character is needed to separate parts of multi-part words where in many cases people incorrectly use space character instead of half-space character. This common incorrectly use of space leads to some s...

متن کامل

Digging into Signs: Emerging Annotation Standards for Sign Language Corpora

This paper describes the creation of annotation standards for glossing sign language corpora as part of the Digging into Signs project (2014-2015, http://www.ru.nl/sign-lang/projects/digging-signs/). This project was based on the annotation of two major sign language corpora, the BSL Corpus (British Sign Language) and the Corpus NGT (Sign Language of the Netherlands). The focus of the gloss ann...

متن کامل

Semiotic Analysis of Written Signs in the Road Sign Systems of Tehran City

Introduction: as a component of the urban landscape, road sign systems are among the most critical elements of urban environments. Generally speaking, the written signs dominate the design of these systems. These signs can also foster aesthetic and visual pleasure compellingly and innovatively. Furthermore, they perpetuate a specific image in the minds of their observers. This research seeks to...

متن کامل

MMAX: A Tool for the Annotation of Multi-modal Corpora

We present a tool for the annotation of XMLencoded multi-modal language corpora. Nonhierarchical data is supported by means of standoff annotation. We define base level and suprabase level elements and theory-independent markables for multi-modal annotation and apply them to a cospecification annotation scheme. We also describe how arbitrary annotation schemes can be represented in terms of the...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1605.07346  شماره 

صفحات  -

تاریخ انتشار 2015